Contributors:

William Loving (wfl9zy) James Sweat (jes9hd)

Goals:

  1. Explore and visualize the broader Computer Science / Data Science industry fields.
  2. Discover interesting correlations between attributes of available jobs using multiple different Datasets.
  3. Learn how to develop meaningful visualizations to communicate the data we have to an uninformed audience.

Part 1: Data Science Positions:

  • Here we will explore Data Scientist Jobs in an around the United States
  • Our main goal will be to visualize information related to what jobs pay the most based on different factors, are there correlations or patterns? etc..

Load Data:

data <- read_csv("../data/data-science-jobs/ds_salaries.csv")
head(data)
## # A tibble: 6 × 12
##    ...1 work_year experience_level employment_type job_title              salary
##   <dbl>     <dbl> <chr>            <chr>           <chr>                   <dbl>
## 1     0      2020 MI               FT              Data Scientist          70000
## 2     1      2020 SE               FT              Machine Learning Scie… 260000
## 3     2      2020 SE               FT              Big Data Engineer       85000
## 4     3      2020 MI               FT              Product Data Analyst    20000
## 5     4      2020 SE               FT              Machine Learning Engi… 150000
## 6     5      2020 EN               FT              Data Analyst            72000
## # ℹ 6 more variables: salary_currency <chr>, salary_in_usd <dbl>,
## #   employee_residence <chr>, remote_ratio <dbl>, company_location <chr>,
## #   company_size <chr>

Explore Data and Make Necessary Transformations:

str(data)
## spc_tbl_ [607 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ...1              : num [1:607] 0 1 2 3 4 5 6 7 8 9 ...
##  $ work_year         : num [1:607] 2020 2020 2020 2020 2020 2020 2020 2020 2020 2020 ...
##  $ experience_level  : chr [1:607] "MI" "SE" "SE" "MI" ...
##  $ employment_type   : chr [1:607] "FT" "FT" "FT" "FT" ...
##  $ job_title         : chr [1:607] "Data Scientist" "Machine Learning Scientist" "Big Data Engineer" "Product Data Analyst" ...
##  $ salary            : num [1:607] 70000 260000 85000 20000 150000 72000 190000 11000000 135000 125000 ...
##  $ salary_currency   : chr [1:607] "EUR" "USD" "GBP" "USD" ...
##  $ salary_in_usd     : num [1:607] 79833 260000 109024 20000 150000 ...
##  $ employee_residence: chr [1:607] "DE" "JP" "GB" "HN" ...
##  $ remote_ratio      : num [1:607] 0 0 50 0 50 100 100 50 100 50 ...
##  $ company_location  : chr [1:607] "DE" "JP" "GB" "HN" ...
##  $ company_size      : chr [1:607] "L" "S" "M" "S" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ...1 = col_double(),
##   ..   work_year = col_double(),
##   ..   experience_level = col_character(),
##   ..   employment_type = col_character(),
##   ..   job_title = col_character(),
##   ..   salary = col_double(),
##   ..   salary_currency = col_character(),
##   ..   salary_in_usd = col_double(),
##   ..   employee_residence = col_character(),
##   ..   remote_ratio = col_double(),
##   ..   company_location = col_character(),
##   ..   company_size = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(data)
##       ...1         work_year    experience_level   employment_type   
##  Min.   :  0.0   Min.   :2020   Length:607         Length:607        
##  1st Qu.:151.5   1st Qu.:2021   Class :character   Class :character  
##  Median :303.0   Median :2022   Mode  :character   Mode  :character  
##  Mean   :303.0   Mean   :2021                                        
##  3rd Qu.:454.5   3rd Qu.:2022                                        
##  Max.   :606.0   Max.   :2022                                        
##   job_title             salary         salary_currency    salary_in_usd   
##  Length:607         Min.   :    4000   Length:607         Min.   :  2859  
##  Class :character   1st Qu.:   70000   Class :character   1st Qu.: 62726  
##  Mode  :character   Median :  115000   Mode  :character   Median :101570  
##                     Mean   :  324000                      Mean   :112298  
##                     3rd Qu.:  165000                      3rd Qu.:150000  
##                     Max.   :30400000                      Max.   :600000  
##  employee_residence  remote_ratio    company_location   company_size      
##  Length:607         Min.   :  0.00   Length:607         Length:607        
##  Class :character   1st Qu.: 50.00   Class :character   Class :character  
##  Mode  :character   Median :100.00   Mode  :character   Mode  :character  
##                     Mean   : 70.92                                        
##                     3rd Qu.:100.00                                        
##                     Max.   :100.00

Make Some Transformations for Ease of Understanding:

data_transformed <- data%>%
  mutate(experience_level = ifelse(experience_level=="EN", "Entry-Level", 
                                   ifelse(experience_level=="MI", "Manager-Level", 
                                          ifelse(experience_level=="SE", "Senior-Level",
                                                 ifelse(experience_level=="EX", "Executive-Level", experience_level)))))

data_transformed <- data_transformed%>%
  mutate(employment_type = ifelse(employment_type=="CT", "Contract-Work", 
                                   ifelse(employment_type=="FT", "Full-Time", 
                                          ifelse(employment_type=="PT", "Part-Time",
                                                 ifelse(employment_type=="FL", "FreeLance", employment_type)))))

data_transformed <- data_transformed%>%
  mutate(company_size = ifelse(company_size=="L", "Large", 
                                   ifelse(company_size=="M", "Medium", 
                                          ifelse(company_size=="S", "Small", company_size))))

data_transformed <- data_transformed%>%
  mutate(remote_ratio = ifelse(remote_ratio==0, "In-Person", 
                                   ifelse(remote_ratio==50, "Hybrid", 
                                          ifelse(remote_ratio==100, "Remote", remote_ratio))))


head(data_transformed)
## # A tibble: 6 × 12
##    ...1 work_year experience_level employment_type job_title              salary
##   <dbl>     <dbl> <chr>            <chr>           <chr>                   <dbl>
## 1     0      2020 Manager-Level    Full-Time       Data Scientist          70000
## 2     1      2020 Senior-Level     Full-Time       Machine Learning Scie… 260000
## 3     2      2020 Senior-Level     Full-Time       Big Data Engineer       85000
## 4     3      2020 Manager-Level    Full-Time       Product Data Analyst    20000
## 5     4      2020 Senior-Level     Full-Time       Machine Learning Engi… 150000
## 6     5      2020 Entry-Level      Full-Time       Data Analyst            72000
## # ℹ 6 more variables: salary_currency <chr>, salary_in_usd <dbl>,
## #   employee_residence <chr>, remote_ratio <chr>, company_location <chr>,
## #   company_size <chr>

Create Plots:

Plot 1:

  • With this plot we can clearly see that as your experience level rises, you can expect to see a corresponding increase in salary.
  • It is also worth noting that different types of work see different effects, for example, contract work is much more volatile than Full Time salaries.

Plot 2:

  • Note that In-Person only paid the highest for Medium Sized Companies, Remote actually had the highest payout for Large
  • Small companies pay grows step-wise with respect to the remote ratio (Hybrid->In-Person->Remote)

Plot 3:

  • A lot of information, but the most interesting is that the US has the highest paying jobs by far with Small companies in Japan as a close second.

Part 1 Closing Remarks:

  • This has been a look into the data science job market examining salary as it relates to company size, the companies remote ratios, and the actual experience levels required for the positions. We will now be moving into more India based Software Engineering Visuals for Part 2.